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Abstract —Text image super-resolution is a challenging yet 
open research problem in the computer vision community. 
In particular, low-resolution images hamper the performance 
of typical optical character recognition (OCR) systems. In 
this article, we summarize our entry to the ICDAR2015 
Competition on Text Image Super-Resolution, Experiments are 
based on the provided ICDAR2015 TextSR dataset [3] and the 
released Tesseract-OCR 3.02 system [1]. We report that our 
winning entry of text image super-resolution framework has 
largely improved the OCR performance with low-resolution 
images used as input, reaching an OCR accuracy score of 
77.19%, which is comparable with that of using the original 
high-resolution images (78.80%). 

Index Terms —super resolution; optical character recogni¬ 
tion. 


1. Introduction 

O PTICAL Character Recognition systems aim at con¬ 
verting textual images to machine-encoded text. By 
mimicking the human reading process, OCR systems enable 
a machine to understand the text information in images 
and recognize specific alphanumeric words, text phrases 
or sentences. Yet, due to the complexity of input contents 
and variations caused by various sources, accurately rec¬ 
ognizing optical characters can be challenging for standard 
OCR systems. In particular, OCR systems trained on high- 
resolution (HR) text images may fail on predicting the 
correctly text in elusive low-resolution (LR) text images. 
Specifically, LR text images lack of fine details, making it 
harder for the OCR systems to correctly retrieve the textual 
information from some commonly acquired OCR features. 
As such, performing super-resolution on input images is an 
intuitive pre-processing step towards the goal of accurate 
optical character recognition. 

To perform image super-resolution, unseen pixels need 
to be predicted under a well-constrained solution space 
with strong prior. By recovering the HR image from its 
LR correspondence, pixel values are non-linearly mapped 
during the process. Following the work [5], [6], we 
adopt a Super-Resolution Convolutional Neural Network 
(SRCNN) approach and train several CNNs targeted on text 
images. The text image super-resolution CNN framework 
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demonstrates its effectiveness in recovering HR text images. 
Our main contributions are twofold: 

1) We extend the SRCNN approach to the specific text- 
image domain, and conduct a comprehensive investigation 
on different network settings (e.g. filter size and number of 
layers). 

2) We conduct model combination to further improve 
the performance. In the ICDAR 2015 Competition on Text 
Image Super-Resolution^, we achieve the best results [3], 
which improve the OCR performance by 16.55% in 
accuracy compared with bicubic interpolation. 

11. Methodology 

A. Super-Resolution Convolutional Neural Network 

The Super-Resolution Convolutional Neural Network 
(SRCNN) [5], [6] is primarily designed for general single¬ 
image super-resolution, and can be easily extended to 
specific super-resolution topics (e.g. face hallucination and 
text image super-resolution) or other low-level vision 
problems (e.g.denoising). The basic framework of SRCNN 
consists of three convolutional layers without pooling or 
fully-connected layers. This allows it to accept an LR image 
Y of any size (after interpolation and padding) and directly 
output the HR image F{Y). Each of the three layers is 
responsible for a specific task. The first layer performs 
feature extraction on the input image and represents each 
patch as a high dimensional feature vector. The second layer 
then non-linearly maps these feature vectors to another set 
of feature vectors which are conceptually the representation 
of the HR image patches. The last layer recombines these 
representations and reconstructs the final HR image. The 
process can be expressed as the following equations: 

F,(Y) = max (0, IL, * Y + 5,) % G {1, 2}; (1) 

F(Y) = IL3*F2(Y) + 53. (2) 

where Wi and Bi represent the filters and biases of the 
ith layer respectively, Fi is the output feature maps and 
denotes the convolution operation. The Wi contains Ui 
filters of support x fi x fi, where fi is the spatial 
support of a filter, Ui is the number of filters, and no is the 
number of channels in the input image. Please refer to [6] 
for more details. 

This framework is fiexible at the choice of parameters. 
We can adopt four or more layers with a larger filter size 

^Competition website: http://Iiris.cnrs.fr/icdar-sr20I5/ 
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to further improve the performance. For example, in [4], 
a feature enhancement layer is added after the feature 
extraction layer to “denoise” the extracted features. In the 
text image domain, we also attempt to use deeper networks 
to further improve performance. 

B. Model combination 

Model combination has been widely used in high-level 
vision problems (e.g. object detection [ 8 ]). In SRCNN, 
each output pixel can be regarded as a prediction of a 
pixel value, thus averaging predictions of different networks 
could inherently improve the accuracy. Further, as a training 
based method, the testing results may vary due to different 
check points and initial values. With model combination, 
this effect is neutralized, and the results will be more stable 
and reliable. 

We adopt a “Greedy Search” strategy to find the best 
model combination. Suppose we have successfully trained 
n models. First, we select the model that achieves the 
highest PSNR (or OCR score) on the validation set. 
Then we combine its results with that of the n models 
successively, and get n 2-model combinations. Here, we 
simply average their output pixel values. The combination 
with the highest PSNR (or OCR score) is saved as the 
best 2-model combination. We keep these two models 
unchanged and find another model from all n models for 
a second round combination. Similarly, we can identify 
the best m-model combination using these n models. Note 
that each model can be selected more than once, and 
the combination of more models is not necessarily better 
than that of less ones. At last, we choose the best model 
combination from the m selected model combinations. 

III. Experimentst 

Datasets. We use the ICDAR2015 TextSR dataset [3] 
provided by the ICDAR 2015 Competition on Text Image 
Super-Resolution. The dataset includes a training set and a 
test set. The training datset consists of 567 HR-LR gray¬ 
scale image pairs together with the annotation files. HR 
and LR images are downsampled by factors of 2 and 4 
from the original images that are extracted form French 
TV video flux. The height of LR images ranges from 9 
to 29, and the annotation is realized using standard letters, 
numbers and 14 special characters. Figure 1 shows several 
examples of HR-LR image pairs in the training set. The test 
set contains only 141 LR images, without annotation and 
HR images. For more details of the dataset, please refer 
to [3]. To evaluate the performance, we select 30 image 
pairs from the training set as our validation set, thus the 
training set used in our experiments contains the rest 537 
image pairs. 

Implementation Details. First, we upsample all LR 
images by a factor of 2 using Bicubic interpolation. Then 
the HR and LR images are of the same size in the training 
set. In the training phase, the ground truth images and LR 
input samples are prepared as 18 x 18^ -pixel sub-images 

^The number 18 is the minimum height of HR images. 
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Fig. I. Examples of LR-HR image pairs in the training set. 


cropped from the training image pairs. The sub-images are 
extracted from the original images with a stride of 2 in 
vertical and 5 in horizontal. In total, the 537 images provide 
156,941 training samples. Different from [5], [ 6 ], we cannot 
remove all borders during training, because the text images 
are much smaller than generic images. As a compromise, 
we fix the output size to be 14 and pad the input LR images 
according to the network scales. For example, if there are 
three layers, then we need to pad (/i + /2 + /s — 3) — 4 
pixels with zeros. We use the Mean Squared Error (MSE) 
as the loss function, which is evaluated by the difference 
between the central 14 x 14 crop of the ground truth 
image and the network output. The filter weights of each 
layer are initialized by drawing randomly from a Gaussian 
distribution with zero mean and standard deviation 0.001 
(and 0 for biases). The learning rate is 10“^ for the last 
layer and 10“^ for the rest layers. All results are based 
on the check point of 5,000 iterations (about 7.8 x 10^ 
backpropagations). 

A. Super-resolution Results of Single Models 

To find the optimal network settings, we conduct four 
sets of experiments with different filter sizes, number of 
filters, number of layers and initial values. 

1) Filter size: First, we adopt a general three-layer 
network structure and change the filter size. The basic 
settings proposed in [5] are /i = 9 ,/2 = l ,/3 = 
5,ni = 64,722 = 32 and 723 = 1, and the network 
can be denoted as 64(9)-32(l)-l(5). We follow [ 6 ] and 
enlarge the filter size of the second layer /2 to be 3, 
5 and 7. Then we have four networks - 64(9)-32(l)- 
1(5), 64(9)-32(3)-l(5), 64(9)-32(5)-l(5) and 64(9)-32(7)- 
1(5). Convergence curves on the validation set are shown 
in Figure 2. Obviously, using a larger filter size could 
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Fig. 2. Three-layer networks with different filter sizes. 


Fig. 4. Comparison between three-layer and the four-layer networks. 
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Fig. 3. Three-layer networks with different number of filters. 


Fig. 5. Four-layer networks with different filter sizes. 


significantly improve the performance. It is worth noticing 
that the difference between 64(9)-32(5)-l(5) and 64(9)- 
32(7)-1(5) is marginal, thus further enlarging the filter size 
will be of little help. 

2) Number of Filters: We then examine whether the 
performance benefits from more filters. For doing this, 
we fix the filter size to be /i = 9, /2 = 7, /s = 5 
and increase the number of filters to be ni = 128 
and 722 = 64, separately. The three networks are 64(9)- 
32(7)-l(5), 128(9)-32(7)-l(5) and 64(9)-64(7)-l(5). From 
convergence curves shown in Figure 3, we observe that 
the performance has been pushed much higher. However, 
when we take a look at the model complexity, 128(9)- 
32(7)-l(5) and 64(9)-64(7)-l(5) have 211,872 and 207,488 
parameters, which are almost double of that for 64(9)- 
32(7)-1(5) (106,336). Therefore, we wish to investigate 
other lower-cost alternatives that could still improve the 
performance. 

3) Number of Layers: A reasonable attempt to improve 
the performance with low cost is to make the network 
deeper. Here, we adopt a four-layer network structure by 
embedding another feature enhancement layer as in [4]. 
Further more, we gradually enlarge the filter size to get 
better results. Overall, we train 8 networks - 64(9)-32(7)- 
16(1)-1(5), 64(9)-32(7)-16(3)-l(5), 64(9)-32(7)-16(5)-l(5), 
64(9)-32(5)-16(5)-l(5), 64(ll)-32(9)-16(7)-l(5), 64(11)- 
32(9)-16(9)-l(5), 64(13)-32(11)-16(9)-1(5) and 64(15)- 
32(13)-16(11)-1(5). 

Figure 4 reveals the convergence curves of the four- 
layer network 64(9)-32(7)-16(1)-1(5) and the three-layer 
networks 128(9)-32(7)-l(5) and 64(9)-32(7)-l(5). Interest¬ 
ingly, the smallest four-layer network 64(9)-32(7)-16(l)- 
1(5) achieves similar performance as the largest three-layer 
network 128(9)-32(7)-l(5). Note that the network 64(9)- 


32(7)-16(1)-1(5) has only 106,448 parameters, which are 
roughly a half of that for 128(9)-32(7)-l(5) (211,872). This 
indicates that deeper networks could outperform shallow 
ones even with less parameters. 

When we enlarge the filter size of the first three layers, 
the performance improves and reaches a plateau at 64(11)- 
32(9)-16(9)-l(5)(see Figure 5). Further enlarging the filter 
size could make the network overfit to the training data (see 
64(13)-32(11)-16(9)-1(5) and 64(15)-32(13)-16(11)-1(5)). 
To avoid overfitting, a larger training set with more diverse 
images can be introduced, but we stick to the provided 
training set for competition. 

When we try much deeper networks (i.e. five layers), we 
find that the training is hard to converge even with smaller 
learning rates. This phenomenon is also observed in [6], 
where deeper networks are hard to train on generic images. 
We may try using other initialization strategies as in [4], 
but this is out of the scope of this paper. 

4) Initial values: Then we explore whether the results 
are affected by different initial values. We train three 
networks with the same structure 64(9)-32(7)-16(5)-1(5) 
but different initial values, which are randomly drawn 
from the same Gaussian distribution for three times. From 
Figure 6, we could see that the behaviours of three 
convergence curves are slightly different. This suggests 
that the results are not unique with different initial values. 
On the other hand, we could also combine the results of 
networks with the same structure but different initial values. 

B. Model Combination 

To conduct model combination, we select 11 four- 
layer networks with different filter sizes and initial val¬ 
ues. They are 64(9)-32(7)-16(l)-l(5), 64(9)-32(7)-16(3)- 
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Fig. 6. Four-layer networks with different initial values. 
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Fig. 7. PSNR values of different model combinations. 
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Fig. 8. Super-resolution results of different methods. 


1(5), 64(9)-32(7)-16(5)-1(5) (4 networks with differ¬ 
ent initial values), 64(9)-32(5)-16(5)-l(5), 64(ll)-32(9)- 
16(7)-1(5), 64(ll)-32(9)-16(9)-l(5), 64(13)-32(11)-16(9)- 
1(5) and 64(15)-32(13)-16(11)-1(5). Following the “Greedy 
Search” strategy, we successively obtain 14 best model 
combinations (from the best single model to the best 14- 
model combination) evaluated with PSNR. Note that we 
can also use the OCR score as evaluation criterion, then 
the results of model combinations could favour a high OCR 
score. 

The PSNR values^ of the selected 14 best model 
combinations are shown in Figure 7, from which we have 
two main observations. First, using model combination 
could significantly improve the performance, e.g. 0.53dB 
improvement from the best single model to the best 2- 
model combination. Second, the PSNR values are stable 
when combining 5 or more models. 

Figure 8 shows the super-resolution results of the best 
single model 64(9)-32(7)-16(5)-1(5) and the best 14-model 
combination. As can be observed, the super-resolved im¬ 
ages are very close to the ground truth HR images. 

C. Test Performance 

In the competition, we submitted two sets of results: 
one favouring the OCR accuracy score (SRCNN-1) and 
the other favouring the PSNR value (SRCNN-2). The 
performance is evaluated on the 141 test images with 
PSNR, RMSE, SSIM and OCR scores by the competition 

^Note that the PSNR values are not in accord with the convergence 
curves, since we remove 4-pixeI borders during training, but keep them 
during testing. 


TABLE I 

Results of the ICDAR2015 Competition on Text Image 
Super-Resolution . 


Method 

RMSE 

PSNR 

MSSIM 

OCR 

Bicubic 

19.04 

23.50 

0.879 

60.64 

LanczosS 

16.97 

24.65 

0.902 

64.36 

Orange Labs [2] 

11.27 

28.25 

0.953 

74.12 

Zeyde et al. [I I] 

13.05 

27.21 

0.941 

69.72 

A+ [9] 

10.03 

29.50 

0.966 

73.10 

Synchromedia Lab [ ] 

62.67 

12.66 

0.623 

65.93 

ASRS [10] 

12.86 

26.98 

0.950 

71.25 

SRCNN-I [5], [6] 

7.52 

31.75 

0.980 

77.19 

SRCNN-2 [5], [6] 

7.24 

31.99 

0.981 

76.10 


committee. The results are published in [3] and listed 
in Table I. The compared methods are briefiy described 
as follows. The bicubic and lanczos3 interpolation are 
the baseline methods, the Orange Labs [2] is the in¬ 
ternal approach used by the organizer. The Zeyde et 
al. [11] and A-f [9] are the representative state-of-the-art 
super-resolution methods. The Synchromedia Lab [ 7 ] and 
ASRS [10] are methods of the other two competition teams. 
Obviously, our method achieves the best performance 
both on OCR score (SRCNN-1) and the reconstruction 
errors (SRCNN-2). The SRCNN-1 improves the OCR 
performance by 16.55% in accuracy compared with bicubic 
interpolation. It is worth noticing that the OCR score 
of using the original HR images is 78.80%, which is 
only 1.61% higher than SRCNN-1. This indicates that 
our method is extremely effective in improving the OCR 
accuracy. 
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IV. Conclusions 

In this article, we summarize our attempt of adopting an 
SRCNN approach in the task of text image super-resolution 
for facilitating optical character recognition. Our method 
boosts the baseline OCR performance by a large margin 
(16.55%). As general image super-resolution has become 
an increasingly important problem in computer vision, we 
realize that more comprehensive studies on whether (or to 
what extend) super resolution further benefits other low- 
level vision tasks are needed. 
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